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Abstract 

The  Toolkit  is  a  family  of  hardware  modules  (processors,  mem¬ 
ory,  interconnect,  and  input-output  devices)  and  a  collection  of  soft¬ 
ware  modules  (compilers,  simulators,  scientific  libraries,  and  high-level 
front  ends)  from  which  high-performance  special-purpose  computers 
can  be  easily  configured  and  programmed.  The  hardware  modules  are 
intended  to  be  standard,  reusable  parts.  These  are  combined  by  means 
of  a  user-reconfigurable,  static  interconnect  technology.  The  Toolkit’s 
software  support,  based  on  novel  compilation  techniques,  produces  ex¬ 
tremely  high-performance  numerical  code  from  hi^-level  language  in¬ 
put,  and  will  eventually  automatically  configure  hardware  modules  for 
particular  applicatiois. 

We  have  completed  fabrication  of  the  Toolkit  processor  module,  and 
an  eight-processor  configuration  is  running  at  MIT.  As  a  demonstration 
of  the  power  of  the  Toolkit  approach,  we  have  used  the  prototype 
Toolkit  to  perform  an  integration  of  the  motion  of  the  Solar  System 
in  a  computation  that  extends  previous  results  by  nearly  two  wders  of 
magnitude. 

This  report  describes  research  done  at  the  Artificial  Intelligence  Labora¬ 
tory  of  the  Massachusetts  Institute  of  Technology.  Support  for  the  laborer 
tory’s  artificial  intelligence  research  is  provided  in  part  by  the  Advanced  Re¬ 
search  Projects  Agency  of  the  Department  of  Defense  imder  0£Bce  of  Naval 
Research  contract  N00014-89-J-3202  and  by  the  National  Science  Founda¬ 
tion  under  grant  number  MIP-9001651.  Andrew  Berlin’s  was  partially  sup¬ 
ported  by  an  IBM  fellowship. 
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The  Supercomputer  Toolkit: 

A  general  framework  for  special-purpose  computing^ 

Special-purpose  computational  instruments  will  play  an  increasing  role 
in  the  practice  of  science  and  engineering.  Although  general-purpose  super¬ 
computers  axe  becoming  more  available,  there  are  significant  applications  for 
which  it  is  appropriate  to  construct  special-purpose  dedicated  computing  en¬ 
gines.  The  Supercomputer  Toolkit  is  intended  to  make  the  construction  and 
programming  of  such  spedal-purpMe  computers  routine  and  inexpensive,  in 
some  cases  even  automatic. 

The  Toolkit  is  a  family  of  hardware  modules  (processors,  memory,  in¬ 
terconnect,  and  input-output  devices)  and  a  collection  of  software  mod¬ 
ules  (compilers,  simulators,  scientific  libraries,  and  high-level  front  ends) 
from  which  high-performance  special-purpose  computers  can  be  easily  con¬ 
figured  and  programmed.  The  hardware  modules  are  intended  to  be  stan¬ 
dard,  reusable  parts.  These  are  combined  by  means  of  a  user-reconfigurable, 
static  interconnect  technology.  The  Toolkit’s  software  support,  based  on 
novel  compilation  techniques,  produces  extremely  high-performance  numer¬ 
ical  code  from  high-level  language  input,  and  will  eventually  autonoatically 
configure  hardware  modules  for  particular  applications. 

Our  Supercomputer  Toolkit  is  intended  to  help  bring  scientists  and  en¬ 
gineers  back  into  the  design  loop  for  their  computing  instruments.  Tra¬ 
ditionally,  they  have  been  intimately  involved  in  the  development  of  their 
instruments.  Computers,  however,  have  been  treated  differently — scientists 
are  primarily  users  of  general-purpose  computers  supplied  by  a  few  remote 
vendors.  Because  a  general-purpose  computer  is  often  not  well-organized  for 
a  particular  problem,  the  construction  of  appropriate  software  can  be  a  long 
and  complex  task.  By  contrast,  a  computer  whose  design  is  specialized  to  a 
particular  problem  can  be  straightforward  to  program  for  the  application  for 
which  it  was  designed.  Moreover,  the  specialized  computer  can  become  an 
ordinary  experimental  instrument,  or  a  component  of  a  measurement  tool  or 
a  process-control  system,  belonging  to  the  group  that  made  it. 

Over  the  past  three  years,  the  MIT  Project  for  Mathematics  and  Com¬ 
putation  and  Hewlett-Packard’s  Information  Architecture  Group  have  been 
collaborating  on  the  design  and  construction  of  a. prototype  Supercomputer 

^with  ^>oiQS>es  to  Joel  Moses  [9]. 
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Toolkit,  and  a  system  configured  with  eight  processors  is  now  running  at 
MIT. 

This  prototype  is  targeted  at  ntimerical  computations  where  performance 
is  limited  by  the  need  to  integrate  systems  of  ordinary  differential  equations. 
Such  computations  are  characterized  by  a  bottleneck  in  scalar  fioating-point 
performance  rather  than  in  I/O  or  in  memory  bandwidth.  These  computa¬ 
tions  are  typically  not  easy  to  vectorize.  Highly  pipelined  vector  processors 
do  not  do  well  on  them  because  the  state  variables  of  the  system  must  in 
general  be  updated  by  computing  different  expressions. 

For  suitable  applications,  we  expect  the  Supercomputer  Toolkit  approach 
to  provide  substantial  advantages  over  general-purpose  supercomputers.  This 
is  not  only  because  of  the  dedicated  hardware,  but  because  the  Toolkit’s  novel 
compilation  strategy  generates  outstandingly  efficient  code  from  general- 
purpose  library  modiiles  and  programs  written  in  high-level  languages.  Bench¬ 
marks  integrating  systems  of  ordinary  differential  equations  confirm  that  a 
single  Tool^t  processor,  programmed  in  highly  abstract  Lisp  code,  achieves 
scalar  floating-point  performance  equivalent  to  a  Cray  IS  programmed  in 
Cray  Fortran.  While  this  does  not  match  the  speed  of  the  fastest  available 
supercomputer,  the  relative  price  advantage  of  the  Toolkit  allows  it  to  be 
used  for  applications  that  would  be  otherwise  infeasible.  We  have  already 
used  our  eight-processor  configt^ration  to  perform  a  computation  of  major 
scientific  significance — an  integration,  running  for  1000  hours,  of  the  com¬ 
plete  Solar  System,  which  improves  upon  previous  computations  by  almost 
two  orders  of  magnitude. 

In  general,  we  envision  that  the  Toolkit  will  be  used  as  follows:  One  begins 
with  an  algorithm  that  performs  the  costly  inner  loop  of  a  computation 
that  warrants  constructing  a  special-purpose  machine.  For  exanq>le,  the 
simulation  part  of  a  mrdtidimensional  optinuzation  in  the  design  of  an  analog 
circuit,  or  the  intimation  of  the  differential  equations  required  for  real-time 
control  of  a  nonlinear  process,  are  appropriate  for  Toolkit  implementation. 
The  Toolkit  software  compiles  the  program,  targeted  for  a  number  of  different 
Toolkit  hardware  configurations,  some  proposed  by  the  user,  others  generated 
automatically  by  the  Toolkit  compiler  itself.  The  compiler  also  produces,  for 
each  configuration,  a  simulation  that  the  user  can  nm  on  the  host  machine 
to  help  evaluate  price-performanoe  tradeoffs.  After  a  configuration  has  been 
selected,  the  user  obtains  the  required  modules,  wires  them  together,  and 
connects  the  machine  to  a  host  computer.  The  configuration  is  verified  by 
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means  of  diagnostics  that  are  automatically  generated  and  loaded  from  the 
host.  The  target  program  is  then  loaded,  and  the  new  machine  is  ready  to 
be  used  by  host  programs  as  a  back-end  processor. 

The  Toolkit’s  advantage  arises  from  two  architectural  principles,  cou¬ 
pled  with  supporting  compiler  technology.  The  first  principle  is  the  use  of 
problem-specific  communication  paths.  Starting  with  a  particular  algorithm, 
one  can  often  find  a  static  arrangement  of  interprocessor  communication 
paths  that  admits  nearly  optimal  utilization  of  processing  power  for  that 
algorithm.  Maxliines  with  specialized  communication  paths  are  currently 
built  to  implement  extremely  high-performance  signad-processing  systems, 
but  these  maudiines  are  expensive,  because  each  one  is  manually  configured 
and  programmed.  We  believe  that  for  many  important  scientific  applications, 
high-performamce  configurations  can  be  generated  in  a  straightforward  way, 
perhaps  even  automatically,  and  easily  constructed  and  programmed  using 
Toolkit  niodules. 

The  second  principle  is  the  use  of  synchronous  uHra-long  instruction  word 
machines.  Even  in  problems  that  are  not  vectorizable,  scientific  applications 
typically  have  substantial  data-independent  structure.  One  can  configure  a 
totally  synchronous  machine,  in  which  most  interprocessor  communication  is 
statically  scheduled  at  compile  time,  fri  effect,  the  multiple  VLIW  execution 
units  of  the  machine  are  programmed  as  a  single  ultra-long  instruction  word 
processor.  This  organization  eliminates  the  need  for  synchronization,  bus 
protocols,  run-time  handshaking,  or  any  operating-system  overhead. 

The  Toolkit’s  compiler  uses  a  novel  strategy  based  upon  partial  evalua¬ 
tion  [3, 4].  It  exploits  the  data-independence  of  typical  numerical  alf^rithms 
to  generate  exceptionally  efficient  object  code,  even  from  source  programs 
that  are  expressed  in  terms  of  highly  abstract  components.  This  has  enabled 
us  to  develop  a  library  of  symbolic  manipulation  conqwnents  to  support  the 
automatic  construction  of  simulation  codes.  As  a  measure  of  success,  our 
Solar-system  simulation  code,  constructed  with  this  library,  achieves  98% 
utilization  of  the  Toolkit’s  available  floating-point  performance. 

The  Toolkit  approach  has  obvious  limitations.  Neither  our  hardware  ar¬ 
chitecture  nor  our  interconnection  technology  can  be  expected  to  scale  to 
systems  with  many  hundreds  of  processors.  On  the  other  hand,  the  Toolkit 
does  realize  a  means,  practical  within  the  limits  of  current  technology,  to 
provide  relatively  inexpensive  supercomputer  performance  for  an  important 
class  of  problems. 
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Figvire  1:  Each  Todkit  proceaaor  module  haa  two  bidiKCtional  I/O  ports.  Any  inter- 
processor  connection  graph  can  be  configured  with  each  node  having  degree  up  to  about 
8.  The  figure  shows  how  to  build  various  network  architectures:  a  mesh,  a  ring  and 
communicating  clusters. 

The  prototype  Toolkit  is  hotised  in  a  minicomputer  chassis  borrowed  from 
an  existing  Hewlett-Packard  product  line  (HP9000/850).  Our  current  collec¬ 
tion  of  eight  processor  boards  consumes  about  1200  watts.  All  cables  for 
data,  clocks,  global  flag,  and  host  communication  connect  to  the  front  edge 
of  each  processor  board.  A  tiser  assembles  a  machine  by  plug^g  in  the  re¬ 
quired  modules  and  connecting  the  cables  appropriately.  When  a  particular 
machine  is  no  longer  needed,  it  can  be  disassembled,  and  its  modules  can  be 
reassembled  into  other  configurations. 

1.1  Toolkit  Processor  overview 

The  processor  design  was  guided  largely  by  two  factors:  the  class  of  prob¬ 
lems  we  were  targeting;  and  the  use  of  off-the-shelf  technology  available  in 
1989.  The  class  of  problems  is  the  solution  of  systems  of  ordinary  differential 
equations — computations  characterized  by  a  large  number  of  floating-point 
operations  on  a  relatively  small  amount  of  data.  Given  the  very  small  design 
team,  the -tedm^ogy-choioes  were  ocmfined  to  readily  available  commercial 
parts  No  custom  chips  are  used  in  the  design,  and  all  components  use  TTL 


logic  levels. 

The  processor  architecture  is  centered  around  a  high-performance  floating¬ 
point  chip  set.  The  remaining  hardware  is  designed  to  feed  operands  to  the 
floating-point  chips  without  interruption.  Figure  2  shows  a  block  diagram 
of  the  processor  with  the  ALU,  Multiplier,  register  file,  data  memories,  ad¬ 
dress  generators,  I/O  ports,  instruction  memory,  and  program  sequencer. 
Not  shown  in  the  diagram  are  the  host  interface,  clock  generator,  and  some 
smaller  buses. 

The  prototype  Toolkit  processor  contains  about  220  integrated  circuits 
laid  out  on  a  13  X  15  inch  printed-circuit  board  as  shown  in  figure  3.  The 
eight-layer  board  has  8-10  mil  traces  with  8  mil  spaces.  Trace  impedances 
range  from  45  ohms  to  80  ohms,  with  the  higher  impedance  layers  used  for 
the  critical  I/O  port  wiring.  The  cycle  time  is  84  ns,  which  results  in  an 
instruction  rate  of  11.9  million  instructions  per  second. 


1.2  Floating-point  unit 

The  heart  of  the  processor  is  a  pair  of  Bipolar  Integrated  Tedmology  (B.I.T.) 
floating-point  math  chips  [5].  The  ALU  chip  (B2120A-25)  does  all  arithmetic 
and  logical  operations  in  25ns.  The  Multiplier  chip  (B2110A-55)  performs 
double-precision  (64-bit)  multiplication  in  55ns.  Both  chips  have  32-bit  I/O 
paths  and  full  64-bit  internal  paths.  To  simplify  the  architecture  and  timing, 
all  operations  are  done  in  double  precision.  Operations  complete  in  one  c^cle 
except  for  the  Multiplier’s  divide  and  square  root,  which  are  multi-c^cle 
operations. 

Separate  instruction  bits  to  each  math  chip  allow  the  ALU  and  Multiplier 
to  be  controlled  independently.  The  peeik  floating-point  performance  is  there¬ 
fore  two  operations  per  cycle,  or  23.8  MFLOPS  per  processor  at  the  current 
84n8  clock  rate.  Section  2.7  below  describes  an  iimer-product  routine  that 
runs  at  this  peak  rate.  In  practice,  we  have  been  able  to  sustain  roughly  half 
this  rate — about  12  MFLOPS  per  processor— on  the  applications  of  interest. 

1.3  Buses  and  Register  File 

As  illustrated  in  figure  2,  the  tooUdt  processor  architecture  contains  a  central 
five-port  register  file  that  communicates  with  each  of  the  data  memories,  with 
the  math  chips,  and  with  the  I/O  ports.  The  five  ports  of  the  roister  file  are 
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Figure  2:  The  Toolkit  procewor  cooeuts  of  floating-point  unit  (ALU  and  Multiplier),  a 
regiater  flle,  two  data  memories,  each  with  its  own  address  generator,  two  bidirectional 
I/O  ports,  instruction  memory,  and  a  program  sequencer.  The  figure  shows  the  maj<» 
^******  T,  LEFT,  and  RIGHT,  Not  shown  are  the  side  paths,  the  internal  in 
the  Math  chips,  the  communication  between  the  sequencer  and  the  rest  of  the  marhin* 
and  the  host  interface. 
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Figlire  3:  The  Toolkit  processor  module  is  laid  out  on  a  13”  x  15”  printed-circuit  board. 
The  six  large  pin  grid  arrays  consist  of  the  four  register  files,  the  ALU  and  the  Multiplier. 
These  chips  occupy  the  center  strip  of  the  board  with  an  almost  synunetric  complement  of 
ICs  surrounding  them.  The  other  specialized  chips  are  the  left  and  right  address  generators 
in  the  68-pin  square  ICs,  and  the  microprogram  sequencer,  the  48-pin  dual  inline  package 
chip. 
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connected  to  five  buses.  The  X  and  Y  buses  carry  operands  from  the  register 
file  to  the  math  chips,  while  the  T  bus  retvims  results  produced  by  the  math 
chips  to  the  register  file.  The  LEFT  bus  coimects  the  roister  file  to  the  left 
memory  and  to  the  left  I/O  port,  while  the  RIGHT  bus  connects  the  register 
file  to  the  right  memory  and  to  the  right  I/O  port.  Even  though  each  I/O 
port  lies  on  the  same  bus  as  a  memory,  the  processor  timing  does  not  permit 
the  I/O  port  and  memory  to  communicate  directly  with  each  other;  instead, 
all  communication  is  performed  via  the  register  file. 

The  central  register  file  contains  thirty-two  72-bit  wide  registers,  to  ac¬ 
commodate  64-bit  IEEE  double-precision  floating-point  numbers  with  eight 
bits  of  parity  protection.  The  entire  register  file  is  constructed  from  four 
B.I.T.  register-file  chips  (B2210A),  each  of  which  provides  sixty-four  18-bit 
wide  registers.  These  four  chips  are  organized  as  two  dual-B2210 A  register- 
files,  which  hold  duplicate  data. 

Each  duaI-B2210A  is  a  pair  of  chips  cascaded  to  form  sixty-four  36-bit 
wide  registers.  To  obtain  72-bit  wide  registers,  the  36-bit  registers  are  used 
in  pairs,  one  register  holding  the  most  significant  half  of  the  word,  aind  the 
other  holding  the  least  significant  half.  Logically,  a  dual-B2210A  provides 
thirty-two  72-bit  registers,  with  two  read  ports,  two  write  ports,  and  one 
read/write  port,  where  each  port  is  36-bit8  wide. 

The  register  file  design,  which  calls  for  two  dual-B2210s  with  duplicate 
contents,  arises  from  the  desire  to  prevent  slow  memory-write  tinning  from 
limiting  the  clock  speed  of  the  processor.  The  math  chips  respond  quickly 
enough  to  allow  the  X,  Y,  and  T  buses  to  each  be  made  only  36-bit8  wide,  and 
used  twice  during  each  cycle  in  order  to  transfer  a  72-bit  quantity.  Similarly, 
the  memory  system  is  able  to  use  the  LEFT  and  RIGHT  buses  twice  per 
cycle  to  transfer  data  from  the  memories  to  the  register  file.  However,  the 
memory  chips  are  too  slow  to  allow  data  to  written  to  memory  twice  per 
cycle.  This  requires  effectively  making  the  LEFT  and  RIGHT  buses  72  bits 
wide  for  register-to-memory  transfers.  To  allow  for  72-bit  r^ister-to-memory 
transfers,  the  two  dual-B2210s  are  connected  as  shown  in  figure  4.  Whenever 
data  is  written  into  one  of  the  dual-B2210A8,  it  is  also  written  into  the  other, 
thereby  ensuring  that  the  contents  of  the  two  register  files  is  identical.  The 
second  register  file  provides  the  additional  read  port  r<!quired  to  access  both 
the  the  least-significant  36-bit8  and  the  most-signilicaat  36-bit8  of  a  word 
simultaneously. 

The  main  buses,  the  data  memories,  the  register  file,  the  I/O  ports,  and 
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Figure  4:  (s)  The  ceatral  register  file  is  constructed  from  BJ.T.  register-file  chips 
(B2210A)  with  two  rend  ports,  two  write  ports,  and  one  read/write  port,  (b)  Using  two 
pairs  of  chips  provides  the  additional  read  ports  (noarked  with  in  the  figure)  required 
perfisrm  72-biLregister«tOrjiienaor.y.  transfers.  These  ports  carry  the  lower  half  of  a  double 
predsion  value  while  the  Left  or  Right  bus  carries  the  upper  half. 
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the  data  cables  are  protected  by  one  parity  bit  for  each  byte  of  data.  Parity 
is  generated  and  checked  by  the  math  chips. 

In  addition  to  the  data  paths  shown  in  figure  2,  there  are  other  buses  in 
the  processor.  The  math  chips  have  a  64-bit  internal  recirculation  path  which 
leads  £rom  the  output  register  back  to  one  of  the  input  registers.  The  output 
of  one  math  chip  can  be  transferred  on  the  T  bus  to  the  input  register  of 
the  other  using  this  path.  Some  calculations  can  be  directly  chained  without 
going  back  through  the  register  file.  This  path  can  be  used,  for  example,  to 
pass  results  from  the  Multiplier  directly  to  the  ALU,  freeing  the  X  and  Y 
buses  to  bring  in  new  operands. 

There  are  two  16-bit  "side  paths”  that  permit  transfers  between  the  ad¬ 
dress  generators  and  the  floating-point  buses.  One  side  path  links  the  most 
significant  sixteen  bits  of  the  X  bus  with  the  left  address  generator.  Similarly, 
the  other  side  path  links  the  Y  bus  with  the  right  address  generator.  The 
side  paths  allow  floating-point  values  to  be  passed  to  the  address  generators 
for  use  in  address  computations.  In  the  other  direction,  the  side-path  buses 
permit  constants  in  the  address-generator’s  immediate  field  to  be  driven  onto 
the  X  and  Y  buses,  to  be  used  as  operands  by  the  math  chips. 

1.4  Data  Memory 

To  keep  the  system  balanced  with  respect  to  bus  bandwidth  and  to  keep 
the  math  chips  as  busy  as  possible,  we  chose  to  have  two  independent  data 
memories.  Each  holds  16K  double-predsion  words  for  a  total  of  256K  bytes 
on  each  processor.  The  data  memory  is  implemented  with  20n8  16K  x  4-bit 
static  RAMs.  The  RAMs  are  accessed  once  in  each  instruction  ^cle. 

Each  memory  has  an  associated  address  generator.  Every  cycle  the  ad¬ 
dress  generator  produces  a  new  data  address  which  is  used  in  the  next  cycle 
to  access  memory.  The  address  generators  are  implemented  with  a  simple 
16-bit  single-chip  microprocessor  ALU  slice  (IDT49C402).*  The  chip  con¬ 
tains  a  general  pxirpose  ALU  backed  by  a  64-entry  address-register  file.  The 
address  registers  can  be  used  to  store  base  addresses  or  to  index  through 
arrays.  Immediate  values  are  supplied  to  the  address  generators  from  the 
instruction  word. 

^This-chipand-thesequeBcer'Chip  ace  16-bit  CMOS-vemoosof  the  fumiliiu-  AMD  2901 
micropioceaaor  slice  and  the  AMD  2910  microprogram  controller. 
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1.5  Sequencer  and  Instruction  Memory 

Overall  program  flow  is  controlled  by  a  sequencer  chip  (IDT49C410).  The 
sequencer  produces  a  new  instruction  address  every  cycle,  as  specified  by  the 
current  instruction  and  conditions  gathered  from  elsewhere  in  the  processor. 
Conditional  jumps  are  based  on  a  number  of  floating-point  flags,  the  global 
flag,  and  a  flag  from  the  host  processor.  The  processor  is  pipelined  such  that 
in  every  cycle,  the  sequencer  calculates  the  address  of  the  instruction  that 
will  be  executed  two  cycles  later.  Thus  a  branch  instruction  issued  in  cycle 
N  will  take  effect  in  cycle  N  +  2.  The  instruction  directly  following  a  jump 
instruction  is  always  executed. 

E)ach  instruction  controls  the  operation  of  the  four  main  functional  units: 
floating-point,  left  and  right  address  generators,  and  sequencer.  Instructions 
are  168  bits  (21  bytes)  in  length,  implemented  with  the  same  20ns  16K  x  4-bit 
parts  as  the  data  RAM.  A  processor  can  hold  16K  instructions  or  336Kbytes. 


1.6  Input /Output  Ports 

The  I/O  ports  are  connected  to  the  Left  and  Right  buses,  along  with  the 
the  data  memories  and  register  file.  Each  port  consists  of  a  double  precision 
register  and  a  very  high  current  (160mA)  TTL  transceiver  (74F306^).  The 
transceiver  drives  the  interboard  ribbon  cables  with  the  upper  and  lower  half 
of  a  data  word  in  a  time  multiplexed  fashion. 

Each  port  transmits  a  word  between  processors  in  two  cycles.  Interboard 
communication  begins  (in  cycle  1)  when  some  processor  transfers  the  con¬ 
tents  of  a  register  to  its  I/O  port.  During  cycles  2  and  3,  the  transmitting 
port  drives  the  cable  with  the  upper  and  lower  halves  of  the  data  word.  Also 
during  cycle  3,  the  receiving  processors  move  the  word  from  the  I/O  port 
to  a  register.  Since  there  is  no  hardware  arbitration  on  the  I/O  ports,  the 
programmer  must  develop  a  convention  for  controlling  access  to  each  com¬ 
munication  channel. 

The  interboard  connections  are  designed  to  avoid  reflections  and  achieve  a 
“first  wavefront”  communication  path.  To  avoid  reflections,  the  transmission 
lines  are  wired  point  to  point  with  no  branches.  On  the  board  they  are  routed 
in  a  continuous  path  from  an  input  connector,  to  the  transceiver,  and  then 
to  an  output  connector.  Cables,  connectors,  board  traces,  and  terminators 
all  have  an  impedance  of  80  ohms.  The  ports  are  implemented  with  open- 
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collector  drivers  to  prevent  program  errors  from  damaging  the  hardware. 


1.7  Clocks 

Clocks  for  the  entire  system  are  generated  from  a  single  crystal  on  a  separate 
clock/host-interface  board.  Copies  of  the  master  clock  are  carefully  buffered 
and  distributed  via  coax  cables  to  each  processor.  Clock  skew  is  held  to  ±1 
ns  by  using  buffers  from  a  single  IC  package  and  matched-length  cables.  This 
margin  is  quite  adequate  for  our  design. 

The  processors  use  a  tapped  delay  line  to  create  controlled  clock  edges 
at  7ns  intervals.  Two  programmable  logic  arrays  combine  various  taps  to 
create  a  multitude  of  clock  waveforms.  Clocks  are  driven  by  high-current 
TTL  gates  (74F1804)  and  are  parallel-terminated  with  about  80  ohms.  This 
strategy  gave  us  acceptable  control  of  skews  among  the  various  clocks,  but 
the  very-high  edge  rates  probably  <x>ntributed  to  problems  we  encountered 
with  clock  bounce  during  debugging. 

This  clocking  methodology  proved  to  be  very  flexible  during  the  design 
phase.  It  promoted  reliable  design  of  setup  and  hold  times.  A  wide  variety  of 
clocking  requirements  for  different  off-the-shelf  parts  were  easy  to  accommo¬ 
date.  On  the  other  hand,  the  large  number  of  clocks  were  complex  to  design, 
modify,  terminate,  and  control. 

1.8  Host  Interface 

Processors  communicate  with  the  host  workstation  via  a  16-bit  general  pur¬ 
pose  parallel  interface.  This  is  a  serial  protocol  that  runs  at  the  speed  of 
the  host  processor,  about  1  ^sec  for  each  transaction.  Eight  of  the  sixteen 
bits  are  used  to  address  one  or  more  processor  boards.  The  remaining  eight 
bits  contain  data  or  control  information.  On  the  processor,  the  host  interface 
connects  to  a  29-byte-long  scan  path  that  threads  through  the  instruction  reg¬ 
ister  and  address  registers.  It  is  constructed  from  8-bit  AMD29818  scannable 
registers  that  are  daisy-chained  togethn.  Shifting  data  along  the  scan  path 
provides  a  bidirectional  interface  with  the  host. 

This  interface  is  very  slow  compared  to  the  rest  of  the  machine,  and  it 
restricts  the  useful  range  of  applications  of  our  prototype  machine  to  those  in 
which  only  a  small  amotmt  of  data  is  transferred  between  the  processors  and 
the  host.  This  admits  significant  computations  (simulations  using  ordinary 
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differential  equations),  but  we  are  aware  of  the  limitations  imposed  here.  The 
host  interface  is  one  of  the  first  areas  of  the  design  that  should  be  improved. 

2  Toolkit  Low-level  Programming  Model 

The  Supercomputer  Toolkit  processor  is  programmed  as  a  very-long  instruc¬ 
tion  word  (VLIW)  computer.  The  programmer  has  full  control  of  all  of  the 
hardware  resotirces  just  disctissed.  Thus,  in  every  cycle,  the  following  oper¬ 
ations  can  be  performed  in  parallel: 

•  Two  floating-point  operations,  one  in  the  ALU  and  one  in  the  Multi¬ 
plier.  The  ALU  and  Multiplier  share  input  and  output  ports  so  two 
values  can  be  fetched  from  the  register  file  and  one  result  can  be  written 
back. 

•  Two  memory-I/0  bus  transactions,  one  on  the  Left  bus  and  one  on  the 
Right  bus.  Data  is  exchanged  between  memory  and  the  main  register 
file,  or  between  the  register  file  and  the  I/O  port. 

•  Two  address  computations,  one  in  the  left  address  generator  and  one 
in  the  right  address  generator.  The  addresses  will  be  used  to  access 
the  data  memories  in  the  following  cycle.  The  address  generators  have 
internal  register  files  to  support  these  operations. 

•  One  sequencer  operation,  to  generate  an  instruction  address.  Typical 
sequencer  instructions  include  conditional  branch,  jump,  continue,  and 
call. 

•  Two  flags  may  be  set:  the  global  flag  goes  to  neighboring  processors; 
the  host  flag  goes  to  the  host  workstation. 

To  program  the  Toolkit  at  this  level,  we  use  a  primitive  symbolic  assembly 
language.  An  instruction  appears  as  a  list  of  up  to  six  tagged  fields: 

((flop 

(lalo  ...)  (xmlo  ...) 

(lag  ...)  (rag  ...) 

(saquancar  ...) 

(flags  ...)) 
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The  fields  may  be  listed  any  order.  The  assembler  supplies  default  values  for 
omitted  fields. 

2.1  Floating-point  Operations 

The  ALU  and  Multiplier  operate  simultaneously.  Each  Toolkit  instruction 
has  a  separate  opcode  field  for  the  ALU  and  Multiplier.  Any  ALU  opcode 
can  appear  together  with  any  Multiplier  opcode.  Most  floating-point,  integer, 
and  logical  operations  require  one  qrcle.  Divide  requires  4  cycles  and  square 
root  requires  7  cycles.  The  entire  list  of  available  operations  can  be  fotmd  in 
the  B.LT.  data  sheet  for  this  chip  set  [5]. 

The  following  six-cycle  sequence  illustrates  the  pipelining  of  these  oper¬ 
ations,  as  well  as  the  assembler  syntax  for  the  floating-point  portion  of  a 
Toolkit  instruction.  As  illustrated  here  and  in  the  examples  to  follow,  the 
processor  pipeline  is  controlled  partly  in  hardware  and  partly  in  software. 
For  example,  floating-point  opcodes  are  spedfied  in  the  same  instruction  as 
the  operand  register  numbers.  Pipeline  registers  hold  and  delay  the  opcodes 
while  the  register  read  takes  place.  Then,  the  opcodes  and  operands  enter 
the  math  chips  together  on  the  following  cycle.  The  command  to  latch  the 
math  chip  result  register  (klatch)  is  specified  in  another  instruction,  and  the 
command  to  write  the  remit  back  to  the  renter  file  is  specified  in  a  third 
(t  ...). 

Cycle  Instruction 

1  ((flop  (x  rlO)  (y  rli)  (*  dnult  x  y))) 

2  ((flop  (x  rl2)  (y  ri3)  (*  dnult  x  y  ftlatch) 

(+  dadd  X  y))) 

3  ((flop  (x  rl4}  (y  ri5)  (*  dnult  x  y  dlatch) 

(t  *  rS)  (,*  dadd  t  z  ftlatch))) 

4  ((flop  (*  tlatch)  (<*^  tlatch))) 

6  ((flop  (t  *  r6))) 

6  ((flop  (t  ♦  r7))) 

The  instruction  in  cycle  1  places  the  contents  of  ru  on  the  X  bus,  the 
contents  of  vu  on  the  Y  bus,  and  passes  the  double-precision  multiply  opcode 
to  the  Multiplier. 
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In  cycle  2,  the  multiplier  computes  the  product  requested  in  cycle  1,  and 
then  latches  the  result  into  its  result  register.  Also  during  cycle  2,  instructions 
and  data  are  passed  to  the  Multiplier  and  ALU  requesting  both  the  product 
ri2ri3  and  the  sum  +  ri3. 

In  cycle  3,  the  product  rjorn  is  driven  from  the  Multiplier  onto  the  T 
bus  and  is  written  into  rs.  This  value  is  also  specified  as  an  operand  for 
the  next  ALU  operation.  Simultaneously,  the  product  ri2ri3  and  the  sum 
ri3  +  ri3  are  computed  and  latched  into  the  result  registers  of  the  Multiplier 
and  ALU.  Data  for  the  next  multiplication  is  transferred  from  registers  ri4 
and  ris.  The  second  ALU  operand,  +  ri3,  is  transferred  over  the  ALU’s 
internal  feedback  path  from  the  result  register  back  to  an  operand  register. 

In  (^cle  4  the  new  sum  (riorn  +  ri2  +  ri3)  is  computed  and  latched  by 
the  ALU.  The  Multiplier  computes  and  latches  ri4ri5. 

In  cycle  5,  the  ALU  resiilt  is  written  to  re  via  the  T  bus. 

In  cycle  6  the  Multiplier  result  computed  and  latched  during  cycle  4  is 
written  to  rr. 


2.2  Bus  Operations 

Each  of  the  two  memory-I/0  btises  can  perform  a  64-bit  transfer  either  be> 
tween  registers  and  memory,  or  between  registers  and  an  I/O  port.  There  are 
no  direct  register  to  register  operations  or  memory  to  memory  operations. 
Register-memory  transfers  require  an  address  to  have  been  generated  during 
the  previous  cycle.  For  example,  the  instruction 

(CLsio  ■->r  r23)  (rmlo  r->*  rl7)) 

loads  r33  with  the  contents  of  the  left-memory  address  and  stores  ri7  into  the 
right  memory  address. 

2.3  Input /Output  Operations 

Communication  between  boards  is  accomplished  by  transferring  64-bit  quan¬ 
tities  between  registers  and  the  I/O  ports.  For  example,  suppose  that  the 
left  I/O  port  of  Processor  1  is  connected  to  the  right  port  of  Processor  2  and 
to  the  left  port  of  Processor  3.  This  code  fragment  transmits  the  contents  of 
rio  in  Processor  1  to  rjo  in  Processor  2  and  to  ras  in  Processor  3: 
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Cycla  Procassor  1  Procaaaor  2  Procassor  3 

1  (Imio  r*>io  rlO)  ... 

2  ...  ...  ... 

3  ...  (xmio  io>>r  r20)  (Imio  io->r  r25) 


2.4  Address  Generation 

E^ch  of  the  two  address  generators  produces  a  new  result  every  cycle,  which 
is  used  as  the  memory  address  for  the  following  cycle.  Each  address  generator 
has  its  own  internal  ALU,  and  64  16'bit  registers  to  store  addresses.  Each 
operation  may  take  its  sources  from  the  registers,  from  a  16-bit  immediate 
fidd  in  the  instruction  word,  or  from  the  main  register  file  via  the  side  paths. 

ALU  operands  are  denoted  by  A  and  B,  the  constant  field  by  D,  a  zero 
source  by  Z.  Available  ALU  operations  include  addition,  subtraction,  and 
Boolean  operations.  The  result  of  an  operation  can  be  passed  to  the  address 
generator  output,  and  may  be  written  back  to  the  internal  address-register 
selected  by  B. 

The  syntax  of  the  address-generator  field  of  a  Toolkit  instruction  is  shown 
in  figure  5.  Here  are  a  few  examples: 

(lag  (add  dz  nop  lov)  (d  1117))  The  left  address  generator  performs 
the  addition  of  D  and  Zero  with  carry-in  set  low.  The  result,  1117, 
is  passed  to  the  output,  without  writing  it  back  to  the  address-register 
file  (nop). 

(rag  (add  zb  rastf  high)  (b  ag-rO))  This  performs  pre-increment  indexed 
addressing  using  register  ag-rO  as  the  index  register.  The  right  address 
generator  increments  the  contents  of  ag-rO  by  adding  it  to  zero  with 
the  carry-in  set  high.  The  incremented  result  becomes  the  address- 
generator  output  and  is  stored  back  into  ag-rO  as  specified  by  the 
destination  operation  rasf . 

(rag  (add  zb  rana  high)  (a  ag~rO)  (b  ag-rO))  This  performs  post-increment 
addressing.  It  increments  the  contents  of  the  B  address-r^ister  and 
stores  the  result  in  the  B  address-r^ister,  thereby  updating  the  con¬ 
tents  of  ag-rO.  This  time,  the  rana  operation  specifies  that  the  address 
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<flald>  (  <lag  I  rag>  <ag-op>  ) 

<«g-op>  (  <aoiire«-op>  <aoiireu>  <d«at-op>  <eazry>  ) 

<a-valii«>  <b-Talii«>  <d-valu«> 

<soiire«-op>  add  |  or  I  and  i  azor  I  ... 

<aoiireaa>  ab  I  zb  |  za  I  da  I  dz  I  ... 

<daat-op>  nop  I  raaf  I  man  |  orag  I  ... 

<eazry>  lo«  I  blgh 

a-aalna  ::■  (  a  <ag-ragiatar  0  -  1S>  ) 

b-valna  ::*  (  b  <ag-ragiatar  0  -  16>  ) 

d-ralna  ::•  (  d  <conataat  0  -  66635>  ) 

Figure  5:  Syntax  of  the  addteaa  generator  fielda.  A  full  liat  of  operationa  can  be  found 
in  the  documentation  for  the  addteaa  generator  chip,  IDT49C402  [T]. 
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generator  output  should  be  the  A-source,  which  is  the  (unincremented) 
contents  of  ag-rO. 

2.5  Instruction  Sequencing 

The  sequencing  operations  [7]  include  conditional  branches,  subroutine  calls, 
and  looping,  all  noiaintained  through  a  33-deep  on-chip  stack.  For  exam¬ 
ple,  loops  are  implemented  using  the  sequencer  operations  push  and  rfct. 
Push  pushes  the  next  address  onto  the  internal  stack,  and  sets  an  internal 
counter  to  the  number  of  loop  iterations  minus  one.  Rfct  decrements  the 
counter  and  keeps  branching  to  the  saved  address  until  the  count  becomes 
negative,  at  which  point  it  pops  the  stadc  and  falls  through  the  loop.  Se¬ 
quencer  operations  are  pipelined  and  take  effect  one  cycle  after  they  appear 
in  the  instruction  stream.  For  example,  the  first  instruction  in  the  body  of 
a  push/rf  ct  loop  will  be  two  cycles  after  the  appearance  of  the  push  in  the 
instruction  stream.^ 

Figure  6  shows  a  subroutine  that  uses  the  loop  instructions  to  upload 
to  the  host  a  known-length  vector  by  repeatedly  calling  the  upload  subrou¬ 
tine,  which  uploads  the  floating-point  number  in  rl.  The  vector  is  stored  in 
consecutive  locations  in  left  memory.  The  caller  initializes  the  left  address- 
register  named  ptr  to  point  to  the  first  of  these  locations.  The  vector  length 
minus  one  the  constant  n-1,  which  must  be  known  at  assembly  time. 

The  program  example  shown  in  figure  7  embeds  upload^vector  in  an 
outer  loop  to  produce  a  routine  that  uploads  all  elements  of  an  n  x  n  matrix 
(where  n  is  known  at  assembly  time).  The  inner  loop  (over  the  elements  of  a 
row)  is  counted,  as  above,  in  the  sequencer.  The  outer  loop  (over  the  rows)  is 
counted  by  incrementing  (with  the  ALU)  the  register  count-up-rovs,  which 
is  initialized  to  — n,  and  branching  bade  so  long  as  the  result  is  n^ative. 
In  keeping  with  the  convention  that  the  ALU  is  used  for  floating-point  op¬ 
erations,  we  assume  that  the  initial  (floating-point)  value  — n  is  stored  in  a 
known  location  address-of-ninus-n,  and  that  one  is  a  roister  containing 
the  (floating-point)  value  1.  We  assume  that  the  matrix  is  stored  so  that 
Mij  is  at  a  base  address  plus  txrow-increment-i-i  where  row-increment  is  a 
constant  specified  at  assembly  time. 

*The  inatroctioD  directly  followiiig  a  jump  is  referred  to  as  the  slot”  in  modem 
reduced  inatruetkm  set  computera. 
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Cyela 

(label  npload-Yaetor) 

1  ( (saquanear  puli  traa  ii-l))  :do  loop  n  tiaas 

::6aiiarata  addraas  <or  first  vaetor  alaaant  daring  tha 
:  :dalay  slot  for  tba  pub: 

2  ((lag  (add  zb  nop  lov)  (b  ptr))) 

: ;Bodj  of  loop  is  epelas  3  throogb  6 

:  ;Ilova  vaetor  alanant  to  B1 

3  ((laio  ■->r  rl) 

(saqaanear  ejs  traa  apload))  ;Babroatina  call  to  opload 

4  0  ;dalay  slot  for  tba  ejs 

6  ((saqaanear  rfet)) 

::ineraMat  tba  pointar  daring  tba  delay  slot  for  tba  rfet: 

6  ((lag  (add  zb  raaf  bigb)  (b  ptr))) 

7  ((saqaanear  ertn  traa))  :ratara  froB  opload-vaetor 

8  ()  (delay  slot  for  tba  ertn 

Figure  6:  A  Toolkit  subroutine  to  upload  a  vector  to  the  host,  using  a  simple  loop 
counter  in  the  sequencer.  Note  in  this  example  that  the  sequencer  operatimis  push,  rfet, 
ejs  (conditional  jump  to  subroutine),  and  ertn  (conditional  return  frexn  subroutine)  ate 
here  all  conditionalized  with  true,  so  that  they  always  take  effect. 


21 


The  routine  is  called  with  the  left-address-generator  register  matrix-base 
pointing  to  Moo.  It  uses  left-address-generator  registers  rov-pointer  to  hold 
a  pointer  to  the  beginning  of  successive  rows,  ptr  to  hold  the  pointer  to  the 
individual  elements  Mij. 

The  upload-matrix  program  uses  a  conditional  jtimp  based  upon  a  flag 
that  signals  whether  the  floating-point  result  is  negative.  Flags  are  set  by 
the  ALU  or  the  Multiplier  when  they  drive  the  T  bus. 

2.6  Multiprocessing  and  Synchronization 

Since  all  Toolkit  processors  are  driven  by  a  single  master  clock,  the  individual 
processors  are  electrically  synchronized  in  terms  of  when  qrcle  boundaries  oc¬ 
cur.  However,  each  processor  has  its  own  sequencer  and  instruction  memory, 
so  the  operations  performed  by  the  processors  are,  in  general,  independent 
of  one  another.  When  data  is  transferred  among  processors  the  programmer 
must  arrange  that  the  receivers  all  grab  the  data  two  cycles  after  the  trans¬ 
mitter  puts  it  out.  This  can  be  passively  arranged  by  cycle  counting  or  by 
an  explicit  synchronization  action. 

The  one-bit  global  flag  can  be  used  to  maintain  synchronization  among 
processors  during  the  course  of  a  computation.  One  programming  technique 
is  to  maintain  synchronization  between  processors  at  all  times  by  having  each 
processor  execute  the  same  instruction  sequence:  All  processors  branch  si¬ 
multaneously,  based  on  the  condition  being  asserted  on  the  global  flag,  in 
effect,  behaving  as  if  there  was  one  central  controller  governing  all  proces¬ 
sors.  This  approach  works  well  on  vectorizable,  data-independent  problems, 
but  breaks  down  when  local  data-dependent  dedsions  must  be  made.  A 
more  complex  alternative  is  to  allow  each  processor  to  ocecute  branches  in- 
dq>endently  based  on  its  own  local  data,  with  synchronization  occurring  only 
occasionally  when  communication  is  required. 

For  example,  a  step  in  a  multiprocessor  program  might  require  each  pro¬ 
cessor  Pi  to  perform  a  computation  Cj,  where  different  C,-  may  require  dif¬ 
ferent  numbers  of  cycles.  In  this  case,  the  global  flag  could  be  used  as 
a  “busy**  indicator:  As  long  as  each  processor  is  busy,  it  asserts  the  flag. 
When  all  processors  are  done,  they  release  the  flag,  informing  all  processors 
that  they  can  proceed  to  the.  next  phase  of  the.computation.  .The.subroutine 
vait-f  or-all-boards,  shown  in  figure  8,  can  be  us^  to  implement  this  kind 
of  synchronization.  Any  processor  calling  this  routine  will  remain  in  a  wait 
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(labfll  npload-aatrix) 

; ;  Initializs  ontar-loop  by  initializing  ro*-pointar  in  tha  lag 
to  tha  start  of  tha  first  ros.  In  tba  saaa  cycla,  nsa  tbs  rag 
to  ganarata  tba  addrass  for  initializing  tba 
; ;  connt-np~ross  ragiatar. 

((rag  (add  dz  nop  lo«)  (d  addrass-of ^diins-n) ) 

(lag  (add  za  razf  los)  (a  natriz-basa)  (b  ros-pointar))) 

: ;  Znitializa  connt-np-ross 
((raio  ■->r  connt-up-ross)) 

(labal  ontar-loop) 

((saqnancar  pnsb  tma  n-1))  :7aady  to  connt  tba  innar  loop 

: ;  Xnitializa  ptr  to  ro«-pointar 

((lag  (add  za  raaf  Ion)  (a  ro«-pointar)  (b  ptr))) 

;;lazt  four  eyclas  fora  tba  innar  loop,  as  in  upload  vactor 
((laio  a->r  rl)  (saqnancar  ejs  tma  upload)) 

0 

((saqnancar  rfct)) 

((lag  (add  zb  raaf  big)i)  (b  ptr))) 

;;Xncraaant  connt-np-rovs,  using  tba  AL0 
((flop  (z  connt-np-rovs)  (y  ,ona)  (a  dadd  z  y))) 

;;Latcb  tba  rasnlt.  In  tbs  saaa  cycla,  incraaant  rov-pointar  in 
::tba  lag  to  point  to  tba  start  of  tba  nszt  row. 

((flop  (+  ftlatcb)) 

(lag  (add  da  raaf  lov)  (a  rov-pointar)  (b  rov-pointar) 

(d  rov-incraaant))) 

:;Stora  tbs  nav  valns  of  connt-i^-rovs,  and  jnap  back  to  tba 
:;start  of  tba  ontar  loop  if  tba  rasnlt  is  still  nagativa 
((flop  (t  +  connt-np-rovs)) 

(saqnancar  cjp  nag  rov-loop)) 

()  ;dalay  slot  for  tba  cjp 

((saqnancar  crtn  tma))  ;ratnm  fron  npload-natriz 

0  slot  for  tba  crtn 


.  Figure?:  XfaisDbolkit  program-uploads  a matriz,  onig- nested  loops. 
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Cycle 

(label  vait-for-all-boards) 

1  ((sequencer  Idct  true  vait-lor-all-boards-done)) 

2  ((sequencer  cjp  true  vait-for-all-boards-assert-ready)) 

(label  «alt-for-all-boards-assert-ready) 

3  ((sequencer  jrp  global  vait-for-all-boards-assert-ready) 

(flags  (global  .  lo«))) 

(label  vait-for-all-boards-done) 

4  0 

5  ((sequencer  crtn)) 

6  0 

Figure  8:  This  routine  uses  the  global  flag  to  synchionue  processors  so  that  they  will  all 
return  in  the  same  cycle  when  the  flag  is  deasserted  by  all  boards.  The  Toolkit  assembler 
default  for  the  flags  field  asserts  the  global  flag,  ensuring  that  it  will  remain  asserted 
until  all  processors  execute  the  instruction  in  cycle  3.  As  each  processor  calls  the  routine, 
it  waits  at  (3)  in  a  1-cycIe  loop  and  deasserts  the  flag.  The  loop  is  implemented  with  the 
sequencer  jzp  conditional  jump  instruction,  which  branches  either  to  the  address  specified 
in  the  instruction,  or  to  the  address  stored  in  the  internal  counter,  depending  on  the  state 
of  the  flag.  The  counter  register  is  loaded  by  the  Idct  instruction  in  cycle  1.  Note  that 
a  one-cycle  loop  is  attained  even  though  the  sequencer  is  pipelined,  by  including  a  jump 
instructicm  (3)  in  the  *^ump  slot"  of  a  previous  junq>  (2).  Interested  readers  should  trace 
through  the  control  structure  here  in  detail,  noting  that  when  the  loop  terminates,  the 
no-op  at  4  is  executed  twice. 

loop  until  all  processors  have  called  the  routine,  whereupon  all  processors 
return  in  the  same  cycle. 

2.7  Benchmark  Examples 

Figure  9  shows  a  single-processor  program  that  computes  the  inner  product 
of  two  vectors  at  the  processor’s  peak  speed  of  two  floating-point  operations 
per  qrcle  (23.8  double-precision  MFLOPS  at  the  current  clock  rate).  The 
two  vectors  are  stored  in  memory,  one  in  the  left  memory  and  one  in  the 
right  memory.  The  routine  is  pipelined  to  use  a  two-cycle  loop  that  does  two 
multiplications  and  two  additions. 
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((sequencer  push  true  n*-l)  ;  vector  length  *  2n 

(flop  (x  rlO)  (y  rlO)  (*  dmult  x  y))  ;  c(rl0)-0 

(lag  (add  dz  ramf  high)  (d  lbase-1)  (b  al)) 

(rag  (add  dz  ramf  high)  (d  rbase-1)  (b  al))) 

((flop  (x  rlO)  (y  rlO)  (*  dnult  x  y  hlatch)  (4^  dadd  x  y)) 

(laio  n->r  rl)  (rmio  m->r  r2) 

(laig  (add  zb  ramf  high)  (b  al)) 

(rag  (add  zb  ramf  high)  (b  al))) 

: :  The  next  tvo  cycles  are  the  inner  loop. 

((sequencer  rfct) 

(flop  (x  rl)  (y  r2)  (t  ♦) 

(e  dmult  X  y  Jblatch)  (4  dadd  t  z  tlatch)) 

(laio  m->r  r3)  (rmio  m->r  r4) 

(lag  (add  zb  ramf  high)  (b  al)) 

(rag  (add  zb  ramf  high)  (b  al))) 

((flop  (x  r3)  (y  r4)  (t  ♦) 

(*  dmult  X  y  tlatch)  (4  dadd  t  z  ftlatch)) 

(laio  a->r  rl)  (rmio  a->r  r2) 

(lag  (add  zb  ramf  high)  (b  al)) 

(rag  (add  zb  ramf  high)  (b  al))) 

((flop  (t  e)  (e  tlatch)  (4  dadd  t  z  tlatch))) 

((flop  (t  *)  (4  dadd  t  z  tlatch))) 

((flop  (4  tlatch))  (sequencer  cjp  true  done)) 

((flop  (t  4  rS)))) 

(label  done) 


Figure  9:  This  routine  computes  the  inner  product  of  two  vectors  of  length  2n  at  the 
processor’s  peak  speed  of  2  floating-point  operations  per  cycle.  One  vector  is  stored  in 
left  memory  beginning  at  location  l-t-lbase-l,  the  other  in  right  memory  beginning  at 
location  1-l-rbase-l.  The  inner  loop,  which  does  two  multiplications  and  two  additions, 
is  executed  n  times. 
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Another  Toolkit  demonstration  program  (code  not  shown  here)  solves  n  x 
n  systems  of  linear  equations  Ax  =  b  by  means  of  Gauss-Jordan  elimination 
with  full  pivoting,  using  the  algorithm  given  in  [12].  The  running  time  of  the 
algorithm  is  dominated  by  the  row  reduction  performed  each  time  a  pivot 
is  selected.  This  involves  subtracting  a  multiple  of  the  pivot  row  from  each 
other  row  of  the  matrix: 

DO  21  LL-l.H 

IF  (LL.NE.ICQL)  THEN 
DUM«A(LL.ICOL) 

A(LL.IC0L)-0.0 
DO  18  L-l.H 

A(LL,L)-A(LL,L)-A(ICOL.L)*DUM 
18  CONTINUE 

B (LL) -B (LL) -B (ICOL) «DUM 
ENDIF 

21  CONTINUE 

The  Toolkit  implementation  runs  the  inner  loop  (DO  loop  18)  at  a  rate  of 
one  floating-point  operation  per  cyde  (11.9  MFLOPS).  This  is  accomplished 
by  holding  the  matrix  A  in  left  memory,  but  copying  the  pivot  row  A  [ICOL ,  *] 
into  right  memory  when  the  pivot  is  chosen.  Using  both  memories  provides 
enough  bandwidth  to  schedule  a  floating-point  operation  on  every  cyde  of 
the  inner  loop. 

The  bottleneck  in  this  computation  is  the  sheired  floating-point  result  bus 
which  is  shared  by  the  Multiplier  and  ALU.  Thus,  even  though  subtractions 
aad  multiplications  could  be  done  simultaneously,  only  a  single  result  can 
be  written  to  the  registers  on  each  cyde.  In  contrast,  the  inner-product 
program  can  sustain  two  floating-point  operations  per  cyde  because  it  uses 
the  ALU  result  register  to  hold  the  partially  computed  sum,  and  writing  to 
the  registers  is  not  required. 


3  High-level  Programming  Model 

The  Toolkit  software  environment  indudes  a  compiler  that  converts  numeri¬ 
cal  routines  written  in  a  high-level  language  (the  Scheme  dialect  of  Lisp  [6]) 
into  extremely  efficient,  highly  pipelined  Toolkit  programs.  These  compiled 
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routines  can  be  combined  with  hand-written  assembly  language  programs  by 
means  of  a  programmable  linker. 

The  Toolkit  compilation  methodology  requires  the  programmer  to  di¬ 
vide  programs  into  data-independent  computations.  A  data-independent 
computation  is  one  in  which  the  sequence  of  operations  is  not  dependent 
on  the  particular  numerical  values  of  the  data  being  manipulated.  For  in¬ 
stance,  the  sequence  of  mxUtiplications  and  additions  performed  in  a  Fast 
Foiirier  Transform  is  independent  of  the  numerical  values  of  the  data  being 
transformed.  The  compiler  generates  Toolkit  instructions  for  a  program’s 
data-independent  computations,  leaving  the  programmer  to  implement  the 
data-dependent  branches  in  assembly  language. 

3.1  Compiling  Data-independent  Computations 

The  first  stage  of  the  compilation  process  uses  the  partial  evaluation  tech¬ 
nique  described  in  [3]  and  [4]  to  extract  the  underlying  numerical  computa¬ 
tion  from  the  high-level  program.  Unlike  high-level  language  programs  which 
may  include  procedure  calls  and  complex  data  structure  manipulations,  the 
optimized  programs  produced  by  partial  evaluation  consist  entirely  of  numer¬ 
ical  operations.  This  purely  numerical  program  is  further  optimized  using 
traditional  compiler  techniques  such  as  constant-folding,  dead-code  eUmiaa- 
tion,  and  common-expression  elimination  to  produce  a  graph  of  numerical 
operations  to  be  performed. 

The  second  stage  of  the  compiler  maps  the  graph  onto  the  particular 
architecture  of  the  Toolkit  processor  to  generate  actual  Toolkit  instructions. 
This  process  has  two  phases.  First,  the  compiler  chooses  a  preliminary  order 
of  operations  to  reduce  memory  traffic  with  no  r^ard  for  actual  pipeline 
delays.  This  order  serves  as  advice  to  the  second  phase,  attempts  to  fill  all 
available  slots  in  the  pipeline.  It  also  imposes  some  other  constraints.  For 
example,  if  two  operands  are  to  be  loaded  at  the  same  time,  the  compiler 
attempts  to  assign  them  to  different  memory  banks. 

This  two-phase  strategy  seems  to  give  very  good  performance.  It  com¬ 
bines  traditional  ‘^esults-up  scheduling,”  in  which  the  data-flow  graph  of  the 
computation  is  analyzed  to  determine  the  ordering  of  the  instructions  that 
will  maximize  theLimmediate  reuse  ofintermediate-results,  with  '‘cyde-based 
scheduling,”  which  works  forward  through  the  program  from  the  operands 
towards  the  results,  choosing  the  order  of  the  instructions  incrementally  in 
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an  attempt  to  keep  the  processor  pipeline  full. 

Compiled  programs  bear  little  resemblance  to  hand-coded  programs.  In 
hand-written  programs,  pipeline  stages  relating  to  a  particular  operation, 
such  as  a  miiltiply,  tend  to  be  located  close  to  each  other.  In  contrast,  the 
compiled  code  spreads  computations  out  over  time  in  order  to  fill  in  pipeline 
slots.  The  compiler  schedules  loads  and  stores  retroactively,  intentionally 
placing  them  as  far  back  in  time  as  possible,  to  leave  memory  access  op¬ 
portunities  available  for  later  instructions.  Indeed,  loading  a  register  from 
memory  may  occur  dozens  of  cycles  before  the  operand  is  used.  Partial  eval¬ 
uation  makes  this  extreme  lookahead  possible  by  providing  huge  basic  blocks 
of  straight-line  niunerical  code.  The  basic  blocks  produced  by  the  partial 
evaluator  are  so  large  that  any  non-trivial  register  allocation  technique  could 
be  expected  to  do  well. 

The  compiled  code  tends  to  be  extremely  efficient,  even  when  generated 
from  very  high-level  source  code.  As  a  simple  example,  figure  10  shows  a 
highly  abstract  definition  of  a  vector  addition  procedure  add-vectors,  im¬ 
plemented  by  applying  to  the  addition  operator  a  general  transformation 
vector-elementvise,  which  converts  an  n-ary  scalar  function  /  into  an  n- 
ary  vector  function  that  applies  /  to  the  corresponding  elements  of  n- vectors, 

=  (vi,V2,,..),v*  =  (vi,v|,... =  (vi,V3,...),  and  produces  the 
vector  of  results  (/(v^, . . . ,  vf ),  /(vj, . . . ,  v, ), . . .).  This  implementation  of 
add-vectors  is  inefficient  in  most  Scheme  implementations.  Since  the  pro¬ 
cedures  operate  on  vectors  of  any  length,  there  must  be  a  run-time  loop  that 
counts  through  the  vector  index  (here  implemented  in  generate-vector). 
An  even  greater  source  of  inefficiency  is  the  fact  that  the  arity  of  the  pro¬ 
cedure  argument  f  is  not  known  to  vector-elenentvise,  which  therefore 
miist  explicitly  construct  lists  for  f  at  run  time.  The  compilation  is  specified 
by  applying  add-vectors  to  two  vectors,  each  consbting  of  four  placehold¬ 
ers  (see  [3]).  The  placeholders  represent  the  inputs  to  the  compiled  program. 
Specifying  the  number  of  vectors  and  vector  length  permits  the  compiler  to 
generate  code  after  all  tests  and  operations  on  data  structures  are  performed 
at  compile  time.  This  leaves  only  the  actual  component  additions  as  the  only 
"real  work”  to  be  performed  at  nm  time. 

Figure  11  shows  the  compiled  output.  The  entire  computation,  including 
moving  data  to  -and  ficom  menmry,  is  accomplished  in  nine  cycles.  This 
density  of  "useful”  operations— 4  additions  in  9  cycles — is  unt3q>ically  low 
for  compiler  output,  because  the  program  is  so  short. 
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(dafin*  (▼actor-alaaantsisa  1) 

(lambda  (vactors) 

(ga&arata-Tactor 
(Taetor-langth  (car  vactors)) 

(lambda  (i) 

(apply  i  (map  (lambda  (▼)  (vactor-ral  t  i)) 
vactors)))))) 

(daliaa  (gaaarata-vactor  siza  proc) 

(lat  ((ana  (maka-vactor  siza))) 

(lat  loop  ((i  0)) 

(il  (a  i  siza) 

ans 

(bagin  (vactor-satl  ans  i  (proc  i)) 

(loop  (+  i  1))))))) 

(daflna  add-vactors  (alamantsisa  4^)) 

(dalina  vactor-1 

(vactor  (maka-placaboldar  'vactor-l-alamant-D 
(maka-placaboldar  *vactor-l-^lamant-2) 
(maka-placaholdar  'vactor-l-^lamant-3) 
(aaka-placaboldar  *vactor-l-alamsnt-4))) 

(dalina  vactor-2 

(vactor  (maka-placaholdar  *vactor-2-alomaat-l) 
(maka-placaholdar  *vactor-2-alamant-2) 
(maka-placaholdar  *vactor-2-alamant-3) 
(maka-placaholdar  *vactor-2-alamant-4))) 

(add-vaetors  vactor-1  vactor-2) 


Figure  10:  This  highly  abstract  implaxnentation  of  add-vactors  works  for  an  arbitrary 
number  of  vectors  of  arbitrary  length.  Using  placeholders,  the  definition  is  specialised  to 
add  two  vectors  of  length  four. 
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Figure  12  shows  a  straightforward  recursive  implementation  of  the  FFT, 
based  upon  some  suitable  representation  of  complex  arithmetic  (for  example, 
representing  a  complex  number  as  a  list  of  its  real  and  imaginary  parts). 
Even  though  the  program  is  simple,  written  with  admost  no  concessions  to 
efficiency,  the  compiler  transforms  this  into  highly  efficient  code.  A  128-point 
FFT  requires  4222  cycles  (0.354  msec,  at  our  clock  rate),  performing  3716 
floating-point  operations,  plus  the  required  moves  to  and  &om  memory.^  This 
is  still  a  rather  low  density  of  floating-point  operations,  compared  with  our 
experience  with  ordinary  differential  equations  (see  section  4).  For  the  FFT 
computation  in  particular,  one  can  do  much  better  than  this  by  using  the 
ALU  and  Multiplier  together  in  inner  product  computations.  The  present 
compiler,  however,  does  not  schedule  the  math  chips  simultaneously.  Given 
the  contention  for  the  floating-point  result  bus,  we  do  not  expect  to  obtain 
a  significant  speed  improvement  in  general.  Inner  product  computations,  of 
course,  are  an  important  exception.  See  the  comments  in  section  2.7. 

3.2  The  Programmable  Linker 

Currently,  the  compiler  handles  only  straight-line  code,  and  permits  calls  to 
a  library  of  hand-coded  numerical  subroutines.  This  restriction  is  severe  but 
tolerable  for  many  important  applications — ^many  numerical  programs  consist 
of  long  segments  of  straight-line  code  separated  by  just  a  few  run-time  tests. 
For  example,  integrating  a  system  of  ordinary  differential  equations  over  an 
interval  with  an  explicit-formula  integrator  such  as  Runge-Kutta  requires 
nm-time  tests  only  to  determine  if  the  end  of  the  interval  has  been  reached, 
and,  for  an  adaptive  integrator,  whether  to  adjust  the  step-size.  In  contrast, 
an  implicit-formula  integrator  such  as  Backward  Euler  requires  the  solution 
of  systems  of  linear  equations  to  choose  good  pivots,  thereby  generating 
data-dependent  operations.  This  is  currently  beyond  the  capability  of  the 
compiler,  and  the  Toolkit  niunerical  subroutine  library  includes  a  hand-coded 
linear-equation  solver  that  can  be  used  in  conjunction  with  compiled  code. 

Combining  compiled  and  hand-written  assembly  code  is  accomplished 
with  a  programmable  linker.  As  simple  example,  we  show  how  to  construct 
Toolkit  program  that  integrates  a  system  of  ordinary  differential  equations 

12ft-pouit  FFT  it  amaD,  but  th«  compiled  block  can  now  be  called  at  a  aubroutine 
in  computing  larger  FFTa. 
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((flop)  (lag  (add  dz  nop  Ion)  (d  6))  (rag  (add  dz  nop  Ion)  (d  1))) 
((flop) 

(lag  (add  dz  nop  Ion)  (d  7)) 

(rag  (add  dz  nop  Ion)  (d  3)) 

(iBio  ■->r  r20) 

(raio  ■->r  rl9)) 

((flop  (+  dadd  z  y)  (z  r20)  (y  rl»)) 

(lag  (add  dz  nop  Ion)  (d  4)) 

(rag  (add  dz  nop  Ion)  (d  0)) 

(laio  ■->r  r3) 

(raio  a->r  r26)) 

((flop  i*  tlatch  dadd  z  y)  (z  r3)  (y  r26)) 

(lag  (add  dz  nop  Ion)  (d  6)) 

(rag  (add  dz  nop  Ion)  (d  2)) 

(laio  a->r  r9) 

(raio  a->r  r8)) 

((flop  (+  glateh  dadd  z  y)  (z  r9)  (y  r8)  (t  *  r20)) 

(lag  (add  dz  nop  Ion)  (d  8)) 

(rag  (add  dz  nop  Ion)  (d  8)) 

(laio  B<->r  rlO) 

(raio  B->r  rl7)) 

((flop  (.*  tlatek  dadd  z  y)  (z  rl9)  (y  rl7)  (t  r3)) 

(lag  (add  dz  nop  Ion)  (d  9)) 

(rag  (add  dz  nop  Ion)  (d  9)) 

(laio  r->a  r20) 

(raio  r->a  r20)) 

((flop  (+  ftlateli)  (t  +  r9)) 

(lag  (add  dz  nop  Ion)  (d  10)) 

(rag  (add  dz  nop  Ion)  (d  10)) 

(laio  r->a  r3) 

(raio  r->a  r3)) 

((flop  (t  +  rl9)) 

(lag  (add  dz  nop  Ion)  (d  11)) 

(rag  (add  dz  nop  Ion)  (d  11)) 

(laio  r->a  r9) 

(raio  r->a  r9)) 

((flop)  (laio  r->a  rl9)  (raio  r->a  rl9)) 


Figure  11:  Tfak  ia  tbe  compiled  oatput  for  the  program  shown  in  figure  10. 
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(d«lin«  (lit  data-list) 

(l«t  ((roots  (roots-of-milty  (longtli  data-list) ) ) ) 

(doflno  (intoxiULl-llt  dsta-list) 

(lot  ((n  (longtli  data-list) ) ) 

(eond  ((a  &  1)  dota-llst) 

((odd?  a)  (oxror  "FFT  iaput  aot  a  pooor  at  2")) 

(also 

(lot  ((oTon-toxma  (intoxnal-llt  (ovoaa  data-list))) 
(odd-toxBO  (**  (iataxaal-llt  (odds  data-list)) 
roots))) 

(ajq^ond  (aap  coKploz^  ovaa-tozas  odd-taras) 

(ai^  eoaplax-  aaaa-taras  odd-taras))))))) 
(lataraal-llt  data-list))) 

(daliaa  (w*  data  roots) 

(if  (aall?  data) 

*0 

(eeas  (eoaplara  (ear  data)  (ear  roots)) 

(a«  (edr  data)  (edr  roots))))) 

(dafiaa  (aaaas  list) 

(il  (aall?  list) 

’() 

(eoas  (car  list)  (ataas  (eddr  list))))) 

(dafiaa  (odds  list) 

(if  (aall?  list) 

'() 

(eoas  (esdr  list)  (odds  (eddr  list))))) 


Figure  12:  A  straightfomrd  FFT  implementatioo,  which  is  transfenned  by  the  compiler 
into  effidant  Toolkit  code. 
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using  Runge-Kutta  integration.  The  program,  running  on  the  Toolkit,  reads 
&om  the  host  an  initisd  state  and  the  number  of  steps  to  integrate,  performs 
the  integration,  and  uploads  the  final  state  to  the  host.  This  is  all  straight- 
line  code  except  for  coimting  the  number  of  steps. 

Figure  13  shows  the  Scheme  source  code  to  be  compiled.  (The  details 
of  the  procedure  ode-system  are  not  shown — ^this  is  any  procediue  that  ac¬ 
cepts  a  state  vector  and  generates  the  vector  of  derivatives.)  Even  though  the 
formulation  of  the  Rimge-Kutta  integrator  is  in  terms  of  abstract  vector  op¬ 
erations,  this  will  exact  no  performance  penalty.  The  compiler  uses  the  spec¬ 
ification  of  the  input  placeholder  data  structure  to  specialize  all  operations, 
in  this  case,  to  vectors  of  length  four.  The  compiler  output,  which  integrates 
the  system  for  one  step,  is  generated  by  the  call  to  compile-kit-program 
(shown  in  the  figure)  which  combines  the  input  placeholder  data  structure 
with  the  Scheme  program  that  defines  the  computation. 

Figure  14  defines  some  address  and  data  structures  used  by  the  linker 
to  handle  the  compiled  output.  The  first  four  define  expressions  cause 
the  linker  to  allocate  memory  for  the  number  of  int^ation  steps,  for  a 
counter,  and  for  the  constants  0  and  1.  The  next  two  define  expressions 
allocate  storage  for  the  inputs  and  outputs  of  the  compiled  computation. 
Figure  15  shows  how  the  compiler  output  is  included  as  a  subroutine  with 
hand-written  assembly  code.  The  def  ine-program  expression  generates  the 
program  text  that  will  be  passed  to  the  assembler.  After  some  initialization, 
the  program  downloads  the  number  of  integration  steps  and  the  initial  state 
to  the  Toolkit,  and  nins  a  simple  loop  that  calls  the  compiler-generated 
integrate-one-step  subroutine.  The  assembly  code  is  written  in  a  noixture 
of  primitive  assembly  instructions  and  simple  assembly-language  macros. 

Note  that  this  assembled  Toolkit  program  could  be  loaded  onto  multiple 
Toolkit  boards,  producing  a  system  that  will  perform  multiple  integrations 
simultaneously,  starting  from  different  initial  states. 

It  is  amusing  to  observe  that  this  style  of  pasting  together  hand-written 
and  compiled  code  inverts  the  traditional  role  of  high-level  and  low-level 
programming.  Ordinarily,  one  writes  code  in  a  high-level  language  that  may 
call  hand-written  assembly-code  subroutines  for  speed-critical  applications. 
Here,  we  write  the  inner-loops  in  a  high-level  language,  and  compile  them 
for  inclusiaa.as  subroutines,  called  firom.hand-.written.assembly  code. 


33 
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; ;  :Rimg«-Kntta-4  iatagntor 
(daliaa  (rkatap  i  stata  h) 

(lat*  ((h*  (seala-vactor  h)) 

(kO  (h*  «  atata))) 

(kl  (k*  (1  (add-aaetora  atata  (1/2*  kO))))) 

(k2  (ba  a  (add-vaetora  atata  (1/2*  kl))))) 

(k3  (b*  (i  (add-aaetora  atata  k2))))) 

(add-aaetora  atata 

(1/6*  (add>aactora  kO  (2*  kl)  (2*  k2)  k3))))) 


: : :  Thia  ia  any  Ixuetion  tbat  takaa  a  atata  aactor 
and  ratnzna  tba  aactor  of  dariaatiaaa. 

(dafina  (oda-ayatan  atata)  ...) 

Va  apaeify  tba  atapalza  bara. 

(dafina  (intagrata-ona-atap  atata) 

(rkatap  oda-ayataai  atata  l.a-6)) 

Tbia  plaeaboldar  atmetnra  apaeializaa  tba  coa^ntation 
to  a  foor-dlnanaloxial  atata. 

(dafina  inpat-placaboldara 
(▼actor  (naka-onknoan  ’xO) 

(naka-nnknoan  *zl) 

(naka-anknoan  *z2) 

(naka-anknoan  'z3))) 

(eoaq^ila-klt-prograa 

"intagrata-^na-atap"  ;aana  aaad  by  linkar  for  tba  raaaltlng  progran 
Inpat-placaboldara  ; inpat  atractara 

intagrata-ona-atap  ;Scbana  dafinition  of  program  to  eoBq>ila 
traa)  ;tbia  flag  dafinaa  antomatically  ganarataa  coda 

;tbat  copiaa  tba  eoapatation'a  oatpata  back  to 
;ita  inpnta  —  bandy  for  loopa 


Figure  13:  Thia  Schama  deflnitiooa  generate  compiled  the  code  for  intagrata-ona-atap, 
which  performa  one  atep  of  a  foarth<order  Runge>Kutta  integration.  The  oda-ayatam  ia 
any  Scheme  ptocedore  that  aecepta  a  vectw  aa  argument  and  retnma  the  CMreqmnding 
vector  of  derivativea.  The  input  placeholder  data  atmctnrecanaea  the  compiler  to  apedaliae 
the  program  to  vectora  ^  length  four. 
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(dalln*  intagrator-limit-addraas  (alloeata-dnal-amory  1)) 
(dafina  intagratoT'-coantar-addrass  (allocata-dnal-aaBory  1)) 


(dalisa  zaro-aoaxca  (dual-saaMry-conataat-locatioa  0.0)) 
(dalina  ona-aonrca  (doal-aiaaory-eoiiatant-locatioa  1.0)) 

(dalina  Intagrata-ona-atap-inimta 

(coarpatation-inpat-data-atnietiiza-'aitli-addraaa-poiiitara 
intagrata-ona-atap) ) 

(dalina  intagiata-ona-atap-outpnta 

(coapntation-ontpat-data-atrnctiira-aitli-addraaa-pointara 
intagrata-ona-atap) ) 


Figure  14:  Definitiona  uaed  by  the  linker  in  handling  the  compiled  leault  of  the  aouice 
code  ahown  in  figure  13. 

4  Example:  A  Breakthrough  in  Solar-System  Inte¬ 
grations 

Even  though  our  prototype  Toolkit  has  been  operational  for  only  a  few 
months,  we  have  already  used  it  to  perform  a  computation  of  major  scien¬ 
tific  significance — an  integration  of  the  complete  Solar  System  that  improves 
upon  previous  computations  by  nearly  two  orders  of  magnitude. 

One  of  our  motivating  examples  in  embarking  on  the  Toolkit  project 
was  our  experience  with  the  Digital  Orrery  [2].  The  Orrery,  constructed  in 
1983-1984,  is  a  spedai-purpose  numerical  engine  optimized  for  high-precision 
numerical  integrations  of  the  equations  of  motion  of  small  numbers  of  grav¬ 
itationally  interacting  bodies.  Using  1980  technology,  the  device  is  about  1 
cubic  foot  of  electronics,  dissipating  150  watts.  On  the  problem  it  was  de¬ 
signed  to  solve,  it  is  measured  to  be  60  times  faster  than  a  VAX  11/780  with 
FPA,  or  1/3  the  speed  of  a  Cray  1S.‘  In  1988,  Sussman  and  Wisdom  [11]  used 
the  Orrery  to  demonstrate  that  the  long-term  motion  of  the  planet  Pluto, 
and  by  implication  the  dynamics  of  the  Solar  System,  is  chaotic.  The  posi¬ 
tions  of  the  outer  planets  were  integrated  for  a  simulated  time  of  845  million 

*With  the  arrival  ci  the  Supercomputer  Toolkit,  the  Orrery  waa  ofBcially  retired,  and 
waa  traaafisrred  to  the  Smithaonian  Inatitution  in  Auguat,  1991. 
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(d«f iaa-prograa 
'( 

((■•qaancar  cjp  tzu«  start-of-prograa) ) 

() 

: ; include  standard  libraries  here 

: : include  the  conpiler  output  as  a  subroutine 
(label  integrate-one-step) 

, C(inelude-coaputation  integrate-one-step) 

((sequencer  ertn  true)) 

() 

(label  start-ol-prograa) 

::code  that  ini^ia'izes  various  constants  dolined  by  the  linker 
(label  restar.) 

;:do«nload  nunber  of  steps  Iron  the  host 
,d(dynaad.c-dosnload-both-nanories  integrator-linit-address) 
::doBnload  the  initial  state 

,<l(dynaaic-dosnload-data-strueture  integrate-one-step-inputs) 

(label  integrator-restart) 

, d(dual-neaory-eopy 

(aake-dual-copy-tpecilier  zero-source 

integrator-counter-address) ) 

(label  integrator-loop) 

,C(junp-il-equal  integrator-linit-address 

integrator-counter-address 
*  retum-ansver ) 

;:othervlse,  increnent  counter  by  one: 
,d(dual-nenory-lalu-nonory-operation  'DADD 

one-source 

integrator-counter-address 
integrator-counter-address ) 

.•(cjs  * integrate-one-step) 

,C(Juaq>  'integrator-loop) 

(label  retxim-ansver) 

,0(dynanic-upload-data-stmeture  integrate-step-outputs) 

,C(Junp  'integrator-restart))) 


Figliie  15:  The  conq>iled  code  generated  by  the  program  in  figure  13  is  combined  with 
hand-generated  assembly  code,  producing  a  program  that  integrates  a  system  of  ODEs  for 
a  given  number  of  steps,  starting  with  an  initial  state. 
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years,  a  computation  in  which  the  Orrery  ran  continuously  for  more  than 
three  months.  Before  the  Orrery,  high-precision  integrations  over  simulated 
millions  of  years  were  prohibitively  expensive,  and  astrophysicists  had  done 
only  a  few  small  experiments  using  carefully  scheduled  resources. 

It  was  natural,  then,  that  our  first  task  for  the  new  Toolkit  was  to  dupli¬ 
cate  some  of  the  Orrery’s  outer  planet  integrations.  This  allowed  us  to  check 
out  and  debug  the  Toolkit  on  a  real  problem.  We  quickly  implemented  a 
high-order  multistep  integrator  of  the  Stormer  type — ^written  in  Scheme  and 
compiled  using  the  Toolkit  compiler — and  discovered  that  each  board  of  the 
Toolkit  was  about  three  times  faster  than  the  entire  Orrery  on  this  program. 
Some  of  this  speedup  was  because  the  Orrery  did  not  have  a  single  instruc¬ 
tion  to  compute  square  root  or  divide.  In  the  Orrery,  such  operations  were 
performed  by  Newton’s  method,  using  a  table  to  get  initial  approximations. 

Wisdom  and  Holman  [13]  then  developed  a  new  kind  of  symplectic  in¬ 
tegrator  for  use  in  Solar-System  integrations.  This  integrator  allows  the 
particles  to  advance  as  if  they  were  only  in  the  field  of  the  Sun.  It  then 
corrects  their  velocities  by  adding  the  change  in  momentum  accrued  over  the 
drift  time  by  perturbations  of  the  other  planets.  This  integrator  is  about  a 
factor  of  ten  faster  than  the  traditional  Stormer’s  method,  but  it  is  not  as 
accurate  over  short  timescales.  Over  long  timescales  it  appears  not  to  accrue 
too  much  error. 

With  the  speedups  available  from  the  Toolkit  and  the  new  integrator, 
we  performed  a  100-million-year  integration  of  the  entire  Solar  System  (not 
just  the  outer  planets,  as  with  the  Orrery),  incorporating  a  post-Newtonian 
approximation  to  General  Relativity  and  corrections  for  the  quadrupole  mo¬ 
ment  of  the  Earth-Moon  system.  The  longest  previous  such  integration  [10], 
recently  published,  was  for  about  3  million  years. 

The  restilts  of  our  integration  verify  the  principal  results  of  the  Orrery 
integrations.  We  also  found  good  evidence  for  chaotic  motions  in  the  inner 
Solar  System.  In  particxilar,  we  found  a  Lyapunov  exponent  for  the  inner 
Solar  System  similar  to  that  predicted  by  Laskar  [8]  in  his  integrations  of 
the  averaged  equations  of  motion.  We  also  verified  Laskar’s  prediction  that 
certain  pendulum-like  resonant  variables  alternate  between  circulation  and 
libration.  A  more  complete  analysis  of  the  data  from  the  integration  will  be 
forthcoming. 

Our  integration  iised  eight  Toolkit  boards  running  in  parallel.  Each  board 
ran  a  Solar  System  with  slightly  different  initial  conditions.  The  evolving 
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differences  were  compared  to  estimate  the  largest  Lyapunov  exponent.  The 
differences  in  initial  conditions  were  chosen  to  be  1  part  in  10^^  in  one  coor¬ 
dinate  of  a  planet.  We  foxmd  that,  the  chaotic  amplification  of  this  difference 
is  such  that  a  1  cm  difference  in  the  position  of  Pluto  at  the  beginning  of  the 
integration  produces  a  1  Astronomical  Unit  difference  in  the  position  of  the 
Earth  at  the  end  of  the  integration. 

Programming  the  integration  proceeded  essentially  along  the  lines  of  the 
simple  Runge-Kutta  example  presented  in  section  3.2.  The  integrator  and 
the  force  law  were  written  as  high-level  Scheme  programs.  The  accumulation 
of  position  was  implemented  in  quad  precision  (128  bits),  amd  the  required 
quad  precision  operators  were  written  in  Scheme.^  The  Scheme  source  was 
compiled  with  the  Toolkit  compiler,  and  the  resulting  routines  were  combined 
with  a  small  amount  of  Toolkit  assembly  code,  similar  to  figure  15.  ^  The 
compiled  code  contains  almost  10,000  Toolkit  cycles  for  each  integration  step, 
and  more  than  98  percent  of  the  cycles  perform  floating-point  operations. 

The  host-side  control  program  was  also  written  in  Scheme.  The  Toolkit 
was  downloaded  with  initial  conditions.  It  was  then  repeatedly  run  for  10^ 
7.2-day  integration  steps,  with  the  state  uploaded  to  the  host  at  the  end  of 
each  10^  step  segment — ^this  took  about  12  minutes  of  real  time.  The  100 
million  year  integration  took  about  5000  such  segments,  for  a  total  time  of 
about  1000  hours  of  run  time. 

5  Conclusions 

In  hindsight,  there  are  a  few  places  where  the  architecture  could  be  improved. 
Most  serious  is  the  fact  that  there  are  no  datapaths  that  allow  results  to 
be  stored  in  code  memory,  so  that,  for  example,  all  repeat  coimts  in  the 
sequencer  must  be  fixed  at  compile  time.  Even  worse,  the  sequencer  chip 
provides  no  method  of  to  get  at  the  addresses  stored  in  its  internal  registers 

*In  hindsight,  the  use  ci  quad  precision  spears  to  have  been  overly  conservative  fat  this 
problem,  and  we  plan  to  rerun  the  computation  at  ordinary  double  precision  to  confirm 
this. 

^Instead  of  running  a  separate  Solar  System  on  each  board,  we  could  have  used  an 
alternative  parallel  decomposition  of  the  problem  into  we  body  per  board,  as  was  done 
with  the  Digital-Orrery.  -In  that  case,  -the-hand-generated-wsspinMy  code  would  have  been 
slightly  different,  because  we  would  have  to  copy  state  information  among  boards,  but  the 
bulk  ci  the  code  generation  would  have  <*till  been  done  by  the  compiler. 
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without  executing  the  instructions  at  those  addresses.  This  makes  it  very 
difiElcult  to  catch  and  process  interrupts,  e.g.,  those  caused  by  parity  errors.  A 
better  design  would  make  that  state  more  accessible.  It  would  be  convenient 
to  set  and  sense  condition  codes  based  upon  results  of  address-generator 
computations.  Also,  the  ALU  and  Multiplier  share  a  single  result  bus  which 
limits  the  opportunities  to  use  these  units  simultaneously.  But  these  are 
mostly  annoyances  rather  than  significant  problems. 

A  more  serious  limitation  of  our  prototype  Toolkit  is  that  we  have  pro¬ 
vided  only  slow  communication  with  the  host.  This  limits  applications  to 
those  that  require  very  little  communication,  such  as  the  long-term  integra¬ 
tion  of  systems  of  ordin2u:y  differential  equations.  It  is  relatively  easy  to 
improve  this  by  fabricating  a  special  board  with  connections  to  the  fast  in¬ 
terprocessor  communication  channels.  Such  a  communications  aulapter  could 
buffer  communications  to  and  from  the  host,  at  host-memory  speed.  Improv¬ 
ing  this  communication  is  a  top  priority  for  future  hardware  development. 

Otit  software-support  system  also  has  a  long  way  to  go.  While  the  com¬ 
piler  makes  it  easy  to  compile  straight-line  segments  of  code,  automating  the 
translation  of  large  algebraically-specified  systems  such  as  force  laws,  it  has 
no  concept  of  datar structure  or  of  data-dependent  conditional  jump.  Thus, 
for  example,  all  of  the  complex  control  structures  that  support  variable  step 
size  and  variable  order  integrators  (such  as  Gear’s  method  for  stiff  systems), 
and  even  the  process  of  selecting  a  pivot  in  a  linear  equation  solver,  must  cur¬ 
rently  be  painfully  constructed  in  assembly  language.  The  resulting  assembly 
code  must  be  combined  with  compiler  output  for  the  straight-line  portions  of 
the  computation.  All  multiprocessor  programs  also  must  be  painfully  glued 
together,  because,  though  we  have  made  strides  in  the  automatic  schedul¬ 
ing  of  multiple  processors  [3],  we  have  not  built  that  into  the  Toolkit  code 
generator. 

Nevertheless,  despite  its  prototype  status,  the  Toolkit  demonstrates  a 
means  practical  within  the  limits  of  current  technology,  to  provide  relatively 
inexpensive  supercomputer  performance  for  a  limited,  but  important  class 
of  problems  in  science  and  engineering.  The  key  is  to  avoid  the  generality — 
both  in  architecttire  and  in  compilation  technology — that  cause  computers 
of  comparable  speed  to  be  expensive  to  design,  build,  and  program.  The 
result  is  machines  that  are  not  as  fast  as  the  fastest  supercomputers,  but 
whose  price  advantage  permits  them  to  be  used  for  applications  that  would 
be  otherwise  infeasible. 


39 


Acknowledgements 


The  Toolkit  project  woiild  not  have  been  possible  without  continual  support 
and  encouragement  from  Joel  Bimbaum.  Henry  Wu  devised  the  board  in- 
tercoimection  strategy.  John  McGrory  built  a  prototype  host  interface  and 
debugging  software,  and  Carl  Heinzl  wrote  an  initial  set  of  functioned  diag¬ 
nostics.  Dan  Ziiras  spent  his  vacation  coding  high-precision  scientific  subrou¬ 
tines.  Rajeev  Surati  wrote  a  simulator  that  heis  become  a  regular  part  of  our 
software  development  system.  Karl  Hassur  and  Dick  Vlach  did  the  clever 
mech2uiical  design  of  the  board  and  cables.  Albert  Chun,  David  Fotland, 
Marlin  Jones,  and  John  Shelton  reviewed  the  hardware  design.  Sam  Cox, 
Robert  Grimes,  and  Bruce  Weyler  designed  the  PC  board  layout.  Darlene 
Harrell  and  Rosemary  Kingsley  provided  cheerful  project  coordination. 
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